Course: Math 535 - Mathematical Methods in Data Science (MMiDS)
Chapter: 4-Spectral graph theory
Author: Sebastien Roch, Department of Mathematics, University of Wisconsin-Madison
Updated: May 30, 2024
Copyright: © 2024 Sebastien Roch


Motivating example: uncovering social groups¶

In this chapter, we analyze datasets in the form of networks. As motivation, we first look at the Karate Club dataset.

From Wikipedia:

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "John A" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.

Figure: Karate Club network (Source)

Karate club network

$\bowtie$

We use the NetworkX package to load the data and vizualize it. We will say more about it later in this chapter. In the meantime, there is a good tutorial here.

In [3]:
import networkx as nx
In [4]:
G = nx.karate_club_graph()
nx.draw_networkx(G)
No description has been provided for this image

Our goal:

identify natural sub-groups in the network

That is, we want to find groups of nodes that have many links between them, but relatively few with the other nodes.

It will turn out that the eigenvectors of the Laplacian matrix, a matrix naturally associated to the graph, contain useful information about such communities.

Background: basic concepts in graph theory¶

NUMERICAL CORNER: In Python, the NetworkX package provides many functionalities for defining, modifying and plotting graphs. For instance, many standard graphs can be defined conveniently. The petersen_graph() function defines the Petersen graph.

In [5]:
import networkx as nx
In [6]:
G = nx.petersen_graph()

This graph can be plotted using the function draw_networkx().

In [7]:
nx.draw_networkx(G, node_size=600, node_color='black', font_size=16, font_color='white')
No description has been provided for this image

Other standard graphs can be generated with special functions, e.g. complete graphs using complete_graph(). See here for a complete list.

In [8]:
G = nx.complete_graph(3)
In [9]:
nx.draw_networkx(G, node_size=600, node_color='black', font_size=16, font_color='white')
No description has been provided for this image

See here and here for a list of functions to access various properties of a graph. Here are a few examples:

In [10]:
G = nx.path_graph(10)
In [11]:
nx.draw_networkx(G, node_size=600, node_color='black', font_size=16, font_color='white')
No description has been provided for this image
In [12]:
G.number_of_nodes() # number of nodes
Out[12]:
10
In [13]:
G.number_of_edges() # number of edges
Out[13]:
9
In [16]:
G.has_edge(0, 1) # checks whether the graph has a particular vertex
Out[16]:
True
In [17]:
G.has_edge(0, 2)
Out[17]:
False
In [18]:
[n for n in G.neighbors(2)] # returns a list of neighbors of the specified vertex
Out[18]:
[1, 3]
In [19]:
nx.is_connected(G) # checks whether the graph is connected
Out[19]:
True
In [20]:
[cc for cc in nx.connected_components(G)] # returns the connected components
Out[20]:
[{0, 1, 2, 3, 4, 5, 6, 7, 8, 9}]
In [21]:
for e in G.edges():
    print(e)
(0, 1)
(1, 2)
(2, 3)
(3, 4)
(4, 5)
(5, 6)
(6, 7)
(7, 8)
(8, 9)

Another way of specifying a graph is to start with an empty graph with a given number of vertices and then add edges one by one. The following command creates a graph with $4$ vertices and no edge (see empty_graph()).

In [22]:
G = nx.empty_graph(4)
In [23]:
G.add_edge(0, 1)
G.add_edge(2, 3)
G.add_edge(0, 3)
G.add_edge(3, 0)
In [24]:
nx.draw_networkx(G, node_size=600, node_color='black', font_size=16, font_color='white')
No description has been provided for this image

NUMERICAL CORNER: The package NetworkX also supports digraphs.

In [25]:
G = nx.DiGraph()
nx.add_star(G, [0, 1, 2, 3, 4])
In [26]:
nx.draw_networkx(G, node_size=600, node_color='black', font_size=16, font_color='white')
No description has been provided for this image

Another way of specifying a digraph is to start with an empty graph with a given number of vertices and then add edges one by one (compare to the undirected case above). The following command creates a graph with no vertices.

In [27]:
G = nx.DiGraph()
In [28]:
G.add_edge(0, 1)
G.add_edge(2, 3)
G.add_edge(0, 3)
G.add_edge(3, 0)
G.add_edge(1,1)
In [29]:
nx.draw_networkx(G, node_size=600, node_color='black', font_size=16, font_color='white')
No description has been provided for this image

$\unlhd$

NUMERICAL CORNER: Using NetworkX, the adjacency matrix of a graph can be obtained with adjacency_matrix(). By default, it returns a SciPy sparse matrix. Alternatively, one can get a regular array with toarray(). Recall that in NumPy (and SciPy) array indices start at $0$. Consistently, NetworkX also names vertices starting at $0$. Note, however, that this conflicts with our mathematical conventions.

In [30]:
G = nx.complete_graph(3)
In [31]:
A = nx.adjacency_matrix(G)
print(A)
  (0, 1)	1
  (0, 2)	1
  (1, 0)	1
  (1, 2)	1
  (2, 0)	1
  (2, 1)	1
In [32]:
A = nx.adjacency_matrix(G).toarray()
print(A)
[[0 1 1]
 [1 0 1]
 [1 1 0]]
In [33]:
G = nx.petersen_graph()
A = nx.adjacency_matrix(G)
print(A)
  (0, 1)	1
  (0, 4)	1
  (0, 5)	1
  (1, 0)	1
  (1, 2)	1
  (1, 6)	1
  (2, 1)	1
  (2, 3)	1
  (2, 7)	1
  (3, 2)	1
  (3, 4)	1
  (3, 8)	1
  (4, 0)	1
  (4, 3)	1
  (4, 9)	1
  (5, 0)	1
  (5, 7)	1
  (5, 8)	1
  (6, 1)	1
  (6, 8)	1
  (6, 9)	1
  (7, 2)	1
  (7, 5)	1
  (7, 9)	1
  (8, 3)	1
  (8, 5)	1
  (8, 6)	1
  (9, 4)	1
  (9, 6)	1
  (9, 7)	1

The incidence matrix is obtained with incidence_matrix() -- again as a sparse array.

In [34]:
B = nx.incidence_matrix(G)
print(B)
  (0, 0)	1.0
  (1, 0)	1.0
  (0, 1)	1.0
  (4, 1)	1.0
  (0, 2)	1.0
  (5, 2)	1.0
  (1, 3)	1.0
  (2, 3)	1.0
  (1, 4)	1.0
  (6, 4)	1.0
  (2, 5)	1.0
  (3, 5)	1.0
  (2, 6)	1.0
  (7, 6)	1.0
  (3, 7)	1.0
  (4, 7)	1.0
  (3, 8)	1.0
  (8, 8)	1.0
  (4, 9)	1.0
  (9, 9)	1.0
  (5, 10)	1.0
  (7, 10)	1.0
  (5, 11)	1.0
  (8, 11)	1.0
  (6, 12)	1.0
  (8, 12)	1.0
  (6, 13)	1.0
  (9, 13)	1.0
  (7, 14)	1.0
  (9, 14)	1.0
In [35]:
B = nx.incidence_matrix(G).toarray()
print(B)
[[1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [1. 0. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 1. 0. 1. 1. 0. 0. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0. 0. 0. 1. 0. 1. 0. 0. 0. 0. 0.]
 [0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0. 0. 0.]
 [0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 1. 1. 0.]
 [0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 0. 0. 0. 1.]
 [0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 1. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 1. 1.]]

In the digraph case, the definitions are adapted as follows. The adjacency matrix $A$ of a digraph $G = (V, E)$ is the matrix defined as

\begin{align*} A_{xy} = \begin{cases} 1 & \text{if $(x,y) \in E$}\\ 0 & \text{o.w.} \end{cases} \end{align*}

The incidence matrix of a digraph $G$ with vertices $1,\ldots,n$ and edges $e_1, \ldots, e_m$ is the matrix $B$ such that $B_{ij} = -1$ if egde $e_j$ leaves vertex $i$, $B_{ij} = 1$ if egde $e_j$ enters vertex $i$, and 0 otherwise.

NUMERICAL CORNER: We revisit an earlier directed graph.

In [36]:
G = nx.DiGraph()
In [37]:
G.add_edge(0, 1)
G.add_edge(2, 3)
G.add_edge(0, 3)
G.add_edge(3, 0)
G.add_edge(1,1)

We compute the adjacency and incidence matrices. For the incidence matrix, one must specify oriented=True for the oriented version.

In [38]:
A = nx.adjacency_matrix(G).toarray()
print(A)
[[0 1 0 1]
 [0 1 0 0]
 [0 0 0 1]
 [1 0 0 0]]
In [39]:
B = nx.incidence_matrix(G, oriented=True).toarray()
print(B)
[[-1. -1.  0.  0.  1.]
 [ 1.  0.  0.  0.  0.]
 [ 0.  0.  0. -1.  0.]
 [ 0.  1.  0.  1. -1.]]

Revisiting an ealier undirected graph, we note that incidence_matrix() can also produce an arbitrary oriented incidence matrix by using the oriented=True option.

In [40]:
G = nx.empty_graph(4)
In [41]:
G.add_edge(0, 1)
G.add_edge(2, 3)
G.add_edge(0, 3)
G.add_edge(3, 0)
In [42]:
B = nx.incidence_matrix(G, oriented=True).toarray()
print(B)
[[-1. -1.  0.]
 [ 1.  0.  0.]
 [ 0.  0. -1.]
 [ 0.  1.  1.]]

Spectral properties of the Laplacian matrix¶

NUMERICAL CORNER: One use of the spectral decomposition of the Laplacian matrix is in graph drawing. We illustrate this next. Given a graph $G = (V, E)$, it is not clear a priori how to draw it in the plane since the only information available are adjacencies of vertices. One approach is just to position the vertices at random. The function networkx.draw() or networkx.draw_networkx()can take as input different graph layout functions that return an $x$ and $y$-coordinate for each vertex.

We will test this on a grid graph. Sometimes a picture is worth a thousand words. This is an example of a $4 \times 7$-grid graph.

Figure: Grid graph (Source)

Grid graph

$\bowtie$

We use grid_2d_graph() to construct such a graph.

In [43]:
G = nx.grid_2d_graph(4,7)

One layout approach is to choose random locations for the nodes. Specifically, for every node, a position is generated by choosing each coordinate uniformly at random on the interval $[0,1]$.

In [44]:
nx.draw(G, pos=nx.random_layout(G))
No description has been provided for this image

Clearly, this is a lot harder to read than the original graph above.

Another approach is to map the vertices to two eigenvectors, similarly to what we did for dimensionality reduction. The eigenvector associated to $\mu_1$ is constant and therefore not useful for drawing. We try the next two. We use the Laplacian matrix.

In [45]:
nx.draw(G, pos=nx.spectral_layout(G))
No description has been provided for this image

Interestingly, the outcome is very similar to the original, more natural drawing. We will come back later to try to explain this, after we have developed further understanding of the spectral properties of the Laplacian matrix.

NUMERICAL CORNER: We construct a graph with two connected components and check the results above. We work directly with the adjacency matrix.

In [46]:
A = np.array([[0, 1, 1, 0, 0], 
              [1, 0, 1, 0, 0], 
              [1, 1, 0, 0, 0], 
              [0, 0, 0, 0, 1], 
              [0, 0, 0, 1, 0]])
print(A)
[[0 1 1 0 0]
 [1 0 1 0 0]
 [1 1 0 0 0]
 [0 0 0 0 1]
 [0 0 0 1 0]]

Note the block structure.

The degrees can be obtained by summing the rows of the adjacency matrix.

In [47]:
degrees = A.sum(axis=1)
print(degrees)
[2 2 2 1 1]
In [48]:
D = np.diag(degrees)
print(D)
[[2 0 0 0 0]
 [0 2 0 0 0]
 [0 0 2 0 0]
 [0 0 0 1 0]
 [0 0 0 0 1]]
In [49]:
L = D - A
print(L)
[[ 2 -1 -1  0  0]
 [-1  2 -1  0  0]
 [-1 -1  2  0  0]
 [ 0  0  0  1 -1]
 [ 0  0  0 -1  1]]
In [50]:
LA.eigvals(L)
Out[50]:
array([ 3.00000000e+00, -3.77809194e-16,  3.00000000e+00,  2.00000000e+00,
        0.00000000e+00])

Observe that (up to numerical error) there are two $0$ eigenvalues and that the largest eigenvalue is greater or equal than the maximum degree plus one.

To compute the Laplacian matrix, one can also use the function laplacian_matrix(). For example, the Laplacian of the Petersen graph is the following:

In [51]:
G = nx.petersen_graph()
L = nx.laplacian_matrix(G).toarray()
print(L)
[[ 3 -1  0  0 -1 -1  0  0  0  0]
 [-1  3 -1  0  0  0 -1  0  0  0]
 [ 0 -1  3 -1  0  0  0 -1  0  0]
 [ 0  0 -1  3 -1  0  0  0 -1  0]
 [-1  0  0 -1  3  0  0  0  0 -1]
 [-1  0  0  0  0  3  0 -1 -1  0]
 [ 0 -1  0  0  0  0  3  0 -1 -1]
 [ 0  0 -1  0  0 -1  0  3  0 -1]
 [ 0  0  0 -1  0 -1 -1  0  3  0]
 [ 0  0  0  0 -1  0 -1 -1  0  3]]
In [52]:
LA.eigvals(L)
Out[52]:
array([ 5.00000000e+00,  2.00000000e+00, -2.80861083e-17,  5.00000000e+00,
        5.00000000e+00,  2.00000000e+00,  2.00000000e+00,  5.00000000e+00,
        2.00000000e+00,  2.00000000e+00])

NUMERICAL CORNER: This is perhaps easiest to see on a path graph. Note: NetworkX numbers vertices $0,\ldots,n-1$.

In [53]:
G = nx.path_graph(10)
In [54]:
nx.draw_networkx(G, 
                 node_size=600, node_color='black',
                 font_size=16, font_color='white', 
                 pos=nx.random_layout(G, seed=535)
                )
No description has been provided for this image

We plot the second Laplacian eigenvector (i.e., the eigenvector of the Laplacian matrix corresponding to the second smallest eigenvalue). We use numpy.argsort() to find the index of the second smallest eigenvalue. Because indices start at 0, we want entry 1 of the output of np.argsort().

In [55]:
L = nx.laplacian_matrix(G).toarray()
w, v = LA.eigh(L)
y2 = v[:,np.argsort(w)[1]]
In [56]:
plt.plot(y2)
plt.show()
No description has been provided for this image

Application to graph partitioning via spectral clustering¶

In [57]:
G_tree = nx.random_tree(n=6, seed=111)
In [58]:
nx.draw(G_tree, pos=nx.circular_layout(G_tree), with_labels=True,
                 node_size=200, node_color="black", 
                 font_size=10, font_color="white")
No description has been provided for this image

NUMERICAL CORNER: (A random tree) We return to the random tree example above. We claimed that $\phi_G = 1/3$. The maximum degree is $\bar{\delta} = 3$. We now compute $\mu_2$. We first compute the Laplacian matrix.

In [59]:
phi_G = 1/3
max_deg = 3

We now compute $\mu_2$. We first compute the Laplacian matrix.

In [60]:
L_tree = nx.laplacian_matrix(G_tree).toarray()
print(L_tree)
[[ 1 -1  0  0  0  0]
 [-1  3  0 -1  0 -1]
 [ 0  0  2 -1 -1  0]
 [ 0 -1 -1  2  0  0]
 [ 0  0 -1  0  1  0]
 [ 0 -1  0  0  0  1]]
In [61]:
w, v = LA.eigh(L_tree) 
mu_2 = np.sort(w)[1]
print(mu_2)
0.32486912943335317

We check Cheeger's inequality. The left-hand side is:

In [62]:
(phi_G ** 2) / (2 * max_deg)
Out[62]:
0.018518518518518517

The right-hand side is:

In [63]:
2 * phi_G
Out[63]:
0.6666666666666666

NUMERICAL CORNER: We implement the graph cutting algorithm above.

We now implement this heuristic in Python. We first write an auxiliary function that takes as input an adjacency matrix, an ordering of the vertices and a value $k$. It returns the cut ratio for the first $k$ vertices in the order.

In [64]:
def cut_ratio(A, order, k):
    
    n = A.shape[0] # number of vertices
    edge_boundary = 0 # initialize size of edge boundary 
    for i in range(k+1): # for all vertices before cut
        for j in range(k+1,n): # for all vertices after cut
            edge_boundary += A[order[i],order[j]] # add one if {i,j} in E
            
    denominator = np.minimum(k+1, n-k-1)
    
    return edge_boundary/denominator

Using the cut_ratio function, we first compute the Laplacian, find the second eigenvector and corresponding order of vertices. Then we compute the cut ratio for every $k$. Finally we output the cut (both $S_k$ and $S_k^c$) corresponding to the minimum, as a tuple of arrays.

In [65]:
def spectral_cut2(A):
    n = A.shape[0] # number of vertices
    
    # laplacian
    degrees = A.sum(axis=1)
    D = np.diag(degrees)
    L = D - A

    # spectral decomposition
    w, v = LA.eigh(L) 
    order = np.argsort(v[:,np.argsort(w)[1]]) # index of entries in increasing order
    
    # cut ratios
    phi = np.zeros(n-1) # initialize cut ratios
    for k in range(n-1):
        phi[k] = cut_ratio(A, order, k)
    imin = np.argmin(phi) # find best cut ratio
    
    return order[0:imin+1], order[imin+1:n]

We will illustrate this on the path graph.

In [66]:
n = 10
G = nx.path_graph(n)
In [67]:
nx.draw(G, node_size=200, node_color='black', with_labels=True,
                 font_size=10, font_color='white',
                 pos=nx.spectral_layout(G))
No description has been provided for this image

We apply our spectral-based cutting algorihtm.

In [68]:
A = nx.adjacency_matrix(G).toarray()
s, sc = spectral_cut2(A)
print(s)
print(sc)
[0 1 2 3 4]
[5 6 7 8 9]

To help with vizualizing the output, we write a function coloring the vertices according to which side of the cut they are on.

In [69]:
def viz_cut(G, s, pos):
    n = G.number_of_nodes()
    assign = np.ones(n)
    assign[s] = 2
    nx.draw(G, node_color=assign, pos=pos, with_labels=False, cmap=plt.cm.brg,
       node_size=200, font_size=10, font_color='white')
In [70]:
pos = nx.spectral_layout(G)
viz_cut(G, s, pos)
No description has been provided for this image

Let's try it on the grid graph. Can you guess what the cut will be?

In [71]:
G = nx.grid_2d_graph(4,7)
A = nx.adjacency_matrix(G).toarray()
s, sc = spectral_cut2(A)
pos = nx.spectral_layout(G)
viz_cut(G, s, pos)
No description has been provided for this image

Back to community detection¶

We return to the Karate Club dataset.

In [72]:
G = nx.karate_club_graph()
n = G.number_of_nodes()
A = nx.adjacency_matrix(G).toarray()
In [73]:
nx.draw(G,  with_labels=True, 
        node_size=200, font_size=10, font_color='white', node_color='black')
No description has been provided for this image

We seek to find natural sub-communities. We use the spectral properties of the Laplacian as described in the lectures.

We use our spectral_cut2 and viz_cut functions to compute a good cut and vizualize it.

In [74]:
s, sc = spectral_cut2(A)
print(s)
print(sc)
[18 26 20 14 29 22 24 15 23 25 32 27  9 33 31 28 30  8]
[ 2 13  1 19  7  3 12  0 21 17 11  4 10  6  5 16]
In [75]:
pos = nx.spring_layout(G)
viz_cut(G, s, pos)
No description has been provided for this image

It is not trivial to assess the quality of the resulting cut. But this particular example has a known ground-truth community structure (which partly explains its widespread use). Quoting from Wikipedia:

A social network of a karate club was studied by Wayne W. Zachary for a period of three years from 1970 to 1972. The network captures 34 members of a karate club, documenting links between pairs of members who interacted outside the club. During the study a conflict arose between the administrator "John A" and instructor "Mr. Hi" (pseudonyms), which led to the split of the club into two. Half of the members formed a new club around Mr. Hi; members from the other part found a new instructor or gave up karate. Based on collected data Zachary correctly assigned all but one member of the club to the groups they actually joined after the split.

This ground truth is the following.

In [76]:
truth = np.array([2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 2, 2, 2, 2, 1, 
    1, 2, 2, 1, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
nx.draw(G, node_color=truth, pos=pos,  with_labels=True, cmap=plt.cm.rainbow,
       node_size=200, font_size=10, font_color='white')
No description has been provided for this image

You can check that our cut perfectly matches the ground truth.

NUMERICAL CORNER: We implement the generation of an inhomogeneous ER graph using NetworkX. We first initialize a pseudorandom number generator rng. To determine whether an edge is present between i and j, we generate a uniform random variable rng.random() (see numpy.random.Generator.random) and add the edge with G.add_edge(i, j) if the random variable is < M[i, j] -- an event which indeed occurs with the desired probability (check it!).

In [77]:
seed = 535
rng = np.random.default_rng(seed)
In [78]:
def inhomogeneous_er_random_graph(n, M, rng):
    # Initialize the random number generator

    # Initialize an empty graph with n vertices
    G = nx.Graph()
    G.add_nodes_from(range(n))
    
    # Generate the adjacency matrix
    for i in range(n):
        for j in range(i + 1, n):
            if rng.random() < M[i, j]:
                G.add_edge(i, j)

    return G

Here is an example usage. We generate probabilities $m_{i,j}$ uniformly at random between $0$ and $1$.

In [79]:
n = 20
M = rng.random([n, n])
M = (M + M.T) / 2  # Ensure symmetry

G = inhomogeneous_er_random_graph(n, M, rng)

We draw the resulting graph.

In [80]:
nx.draw(G, with_labels=True,
       node_size=200, font_size=10, font_color='white', node_color='black')
plt.show()
No description has been provided for this image

The following subroutine generates an ER graph.

In [81]:
def er_random_graph(n, p, rng):

    # Create the probability matrix M
    M = p * (np.ones((n, n)) - np.eye(n))
    
    # Generate the graph using the inhomogeneous ER random graph function
    G = inhomogeneous_er_random_graph(n, M, rng)
    return G

To confirm our previous calculations, below is the implementation of a routine to estimate the edge density for an ER graph with a fixed parameter $p$. Recall that the edge density is defined as the number of edges present divided by the number of possible edges (i.e., the number of pairs of distinct vertices). The routine takes advantage of the Law of Large Numbers by generating a large number of sample graphs, computing their edge density, and then taking the mean.

In [82]:
def estimate_edge_density(n, p, rng, num_samples=100):

    total_edges = 0
    total_possible_edges = n * (n - 1) / 2
    
    for _ in range(num_samples):
        G = er_random_graph(n, p, rng)
        total_edges += G.number_of_edges()
    
    average_edges = total_edges / num_samples
    edge_density = average_edges / total_possible_edges
    return edge_density

On a small example, we indeed get that the edge density is roughly $p$.

In [83]:
n = 10
p = 0.3
num_samples = 1000

estimated_density = estimate_edge_density(n, p, rng, num_samples)
print(f"Estimated edge density for an ER graph with n={n} and p={p}: {estimated_density}")
Estimated edge density for an ER graph with n=10 and p=0.3: 0.3004888888888889

When $n$, the number of vertices, is large, random graphs tend to exhibit fascinating large-scale emergent behavior. One classical example involves the probability of being connected in an ER graph. To illustrate, below is code to estimate that probability over a range of edge densities $p$.

In [84]:
def estimate_connected_probability(n, p, rng, num_samples=100):

    connected_count = 0
    
    for _ in range(num_samples):
        G = er_random_graph(n, p, rng)
        if nx.is_connected(G):
            connected_count += 1
    
    connected_probability = connected_count / num_samples
    return connected_probability
In [85]:
def plot_connected_probability(n, rng, p_values, num_samples=100):

    probabilities = []
    
    for p in p_values:
        prob = estimate_connected_probability(n, p, rng, num_samples)
        probabilities.append(prob)
    
    plt.figure(figsize=(10, 6))
    plt.plot(p_values, probabilities, marker='o')
    plt.xlabel('p')
    plt.ylabel('Estimated probability of being connected')
    plt.title(f'Estimated Probability of Being Connected for ER Graph with n={n}')
    plt.grid(True)
    plt.show()
In [86]:
n = 100  # Number of vertices
p_values = np.linspace(0, 0.1, 50)  # Range of p values
num_samples = 250  # Number of samples to average over
plot_connected_probability(n, rng, p_values, num_samples)
No description has been provided for this image

Above, we ran the code for n equal to $100$. What do you observe? The probability of being connected starts out at $0$ when $p$ is small, which is understandable since it implies that the graph has a relatively small number of edges. But then that probability increases rapidly to $1$ as $p$ crosses a threshold. This is referred to as the phase transition of the ER graph.

It can be shown rigorously that the transition occurs at roughly $p = \log n/n$. That is:

In [87]:
np.log(n)/n
Out[87]:
0.04605170185988092

which is consistent with the plot above.

NUMERICAL CORNER: We implement the SBM model. We use blocks numbered $0$ and $1$.

In [88]:
def sbm_random_graph(n, block_assignments, B, rng):

    # Create the block assignment matrix Z
    num_blocks = B.shape[0]
    Z = np.zeros((n, num_blocks))
    for i in range(n):
        Z[i, block_assignments[i]] = 1
    
    # Compute the probability matrix M
    M = Z @ B @ Z.T
    
    # Generate the graph using the inhomogeneous ER random graph function
    G = inhomogeneous_er_random_graph(n, M, rng)
    
    return G

Here is an example usage. We first pick a block assignment at random. Specifically, blocks are assigned randomly with numpy.random.Generator.choice. It produces two blocks by assigning each vertex with equal probability to either block, independently of all other choices.

In [89]:
n = 50
block_assignments = rng.choice(2, n)  # Randomly assign vertices to two blocks
B = np.array([
    [0.8, 0.1],
    [0.1, 0.8]
])

G = sbm_random_graph(n, block_assignments, B, rng)

We draw the graph with colored nodes based on block assignments. The "good" cut is clearly visible.

In [90]:
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color=block_assignments, cmap=plt.cm.rainbow,
       node_size=200, font_size=10, font_color='white')
plt.show()
No description has been provided for this image

We introduce a subroutine which assigns blocks at random as follows. Let $\beta_1, \beta_2 \in [0,1]$ with $\beta_1 + \beta_2 = 1$ be the probability that a vertex belongs respectively to block $1$ and $2$. We collect these probabilities in the following vector

$$ \bbeta = (\beta_1, \beta_2). $$

We pick block $z(i) \in \{1,2\}$ for each vertex $1 \leq i \leq n$ according to the distribution $\bbeta$, independently of all other vertices $\neq i$.

In [91]:
def generate_block_assignments(n, beta, rng):

    block_assignments = rng.choice(len(beta), size=n, p=beta)
    
    return block_assignments

Here is an example usage.

In [92]:
n = 50  # number of vertices
beta = [0.33, 0.67]  # block probabilities
B = np.array([[0.5, 0.03], [0.03, 0.4]])  # block connection probabilities

block_assignments = generate_block_assignments(n, beta, rng)
G = sbm_random_graph(n, block_assignments, B, rng)

Observe that the blocks are more unbalanced this time.

In [93]:
pos = nx.spring_layout(G)
nx.draw(G, pos, with_labels=True, node_color=block_assignments, cmap=plt.cm.rainbow,
       node_size=200, font_size=10, font_color='white')
plt.show()
No description has been provided for this image

To test our spectral partitioning algorithm, we run spectral_cut2, which indeed recovers the ground truth.

In [94]:
A = nx.adjacency_matrix(G).toarray()
s, sc = mmids.spectral_cut2(A)
mmids.viz_cut(G, s, pos)
No description has been provided for this image

The following code computes the fraction of incorrectly assigned vertices. Note that it considers two assignments corresponding to swapping the labels 0 and 1 which cannot be inferred.

In [95]:
def calculate_incorrect_fraction(block_assignments, inferred_s, inferred_sc):
    
    n = len(block_assignments)
    
    # Create inferred block assignments
    inferred_assignments = np.zeros(n)
    for i in inferred_s:
        inferred_assignments[i] = 0
    for i in inferred_sc:
        inferred_assignments[i] = 1
    
    # Calculate the fraction of incorrect assignments
    incorrect_assignments_1 = np.sum(block_assignments != inferred_assignments) / n
    incorrect_assignments_2 = np.sum(block_assignments == inferred_assignments) / n

    # Return the best (minimum) fraction
    return np.minimum(incorrect_assignments_1, incorrect_assignments_2)

We confirm on our previous example that the ground truth was perfectly recovered.

In [96]:
fraction_incorrect = calculate_incorrect_fraction(block_assignments, s, sc)
print(f"Fraction of incorrectly assigned vertices: {fraction_incorrect}")
Fraction of incorrectly assigned vertices: 0.0

One expects that the ground truth is harder to recover if the probability of an edge between blocks is close to that within blocks, which makes the community structure more murky. To test this hypothesis, we modify our previous example by significantly increasing the inter-block probability. This time the blocks are much harder to tease apart through visual inspection.

In [97]:
n = 100  # number of vertices
beta = [0.55, 0.45]  # block probabilities
B = np.array([[0.55, 0.25], [0.25, 0.45]])  # block connection probabilities

block_assignments = generate_block_assignments(n, beta, rng)
G = sbm_random_graph(n, block_assignments, B, rng)

We run spectral_cut2. It recovers the ground truth only partially this time.

In [98]:
A = nx.adjacency_matrix(G).toarray()
s, sc = mmids.spectral_cut2(A)
fraction_incorrect = calculate_incorrect_fraction(block_assignments, s, sc)
print(f"Fraction of incorrectly assigned vertices: {fraction_incorrect}")
Fraction of incorrectly assigned vertices: 0.22

NUMERICAL CORNER: We first construct the block assignment and matrix $M$ in the case of SSBM.

In [99]:
def build_ssbm(n, p, q):

    # Check if n is even
    if n % 2 != 0:
        raise ValueError("Number of vertices n must be even.")
    
    # Create block assignments
    block_assignments = np.zeros(n, dtype=int)
    block_assignments[n//2:] = 1  # Assign the second half of vertices to block 2
    
    # Create the all-one matrix J
    J = np.ones((n//2, n//2))
    
    # Construct the block matrix M
    M_top = np.hstack((p * J, q * J))
    M_bottom = np.hstack((q * J, p * J))
    M = np.vstack((M_top, M_bottom))
    
    return block_assignments, M

Here is an example.

In [100]:
n = 10  # Number of vertices (must be even)
p = 0.8  # Intra-block connection probability
q = 0.2  # Inter-block connection probability

block_assignments, M = build_ssbm(n, p, q)
print("Block Assignments:", block_assignments)
print("Matrix M:\n", M)
Block Assignments: [0 0 0 0 0 1 1 1 1 1]
Matrix M:
 [[0.8 0.8 0.8 0.8 0.8 0.2 0.2 0.2 0.2 0.2]
 [0.8 0.8 0.8 0.8 0.8 0.2 0.2 0.2 0.2 0.2]
 [0.8 0.8 0.8 0.8 0.8 0.2 0.2 0.2 0.2 0.2]
 [0.8 0.8 0.8 0.8 0.8 0.2 0.2 0.2 0.2 0.2]
 [0.8 0.8 0.8 0.8 0.8 0.2 0.2 0.2 0.2 0.2]
 [0.2 0.2 0.2 0.2 0.2 0.8 0.8 0.8 0.8 0.8]
 [0.2 0.2 0.2 0.2 0.2 0.8 0.8 0.8 0.8 0.8]
 [0.2 0.2 0.2 0.2 0.2 0.8 0.8 0.8 0.8 0.8]
 [0.2 0.2 0.2 0.2 0.2 0.8 0.8 0.8 0.8 0.8]
 [0.2 0.2 0.2 0.2 0.2 0.8 0.8 0.8 0.8 0.8]]
In [101]:
G = inhomogeneous_er_random_graph(n, M, rng)

The eigenvectors and eigenvalues of the Laplacian in this case are:

In [102]:
L = nx.laplacian_matrix(G).toarray()
eigenvalues, eigenvectors = LA.eigh(L)
In [103]:
print("Laplacian Matrix:\n", L)
Laplacian Matrix:
 [[ 5 -1 -1 -1  0  0 -1  0 -1  0]
 [-1  4 -1 -1 -1  0  0  0  0  0]
 [-1 -1  6 -1 -1  0  0 -1 -1  0]
 [-1 -1 -1  3  0  0  0  0  0  0]
 [ 0 -1 -1  0  4  0 -1  0  0 -1]
 [ 0  0  0  0  0  4 -1 -1 -1 -1]
 [-1  0  0  0 -1 -1  6 -1 -1 -1]
 [ 0  0 -1  0  0 -1 -1  3  0  0]
 [-1  0 -1  0  0 -1 -1  0  5 -1]
 [ 0  0  0  0 -1 -1 -1  0 -1  4]]
In [104]:
print("Eigenvalues:\n", eigenvalues)
Eigenvalues:
 [-1.70230221e-15  1.57526930e+00  2.82758054e+00  3.52977544e+00
  4.45245585e+00  5.00000000e+00  5.68697833e+00  6.00000000e+00
  6.97601009e+00  7.95193045e+00]
In [105]:
print("The first two eigenvectors:\n", eigenvectors[:,:2])
The first two eigenvectors:
 [[-0.31622777 -0.22785421]
 [-0.31622777 -0.40837567]
 [-0.31622777 -0.17126723]
 [-0.31622777 -0.56677174]
 [-0.31622777 -0.02430785]
 [-0.31622777  0.40837567]
 [-0.31622777  0.21133116]
 [-0.31622777  0.31475394]
 [-0.31622777  0.15474418]
 [-0.31622777  0.30937174]]

The first eigenvalue is roughly $0$ as expected with an eigenvector which is proportional to the all-one vector. The second eigenvalue is somewhat close to the expected $q n = 0.2 \cdot 10 = 2$ with an eigenvector that has different signs on the two blocks. This is consistent with our prediction.

The eigenvalues exhibit an interesting behavior that is common for random matrices. This is easier to see for larger $n$.

In [106]:
n = 100  # Number of vertices (must be even)
p = 0.8  # Intra-block connection probability
q = 0.2  # Inter-block connection probability
block_assignments, M = build_ssbm(n, p, q)
G = inhomogeneous_er_random_graph(n, M, rng)
L = nx.laplacian_matrix(G).toarray()
eigenvalues, eigenvectors = LA.eigh(L)
In [107]:
plt.figure(figsize=(10, 6))
plt.hist(eigenvalues, bins=30, edgecolor='black')
plt.title('Histogram of Eigenvalues of the Laplacian Matrix (n=100)')
plt.xlabel('Eigenvalue')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()
No description has been provided for this image

The first two eigenvalues are close to $0$ and $0.2 \cdot 100 = 20$ as expected. The rest of the eigenvalues are centered around $ (0.2 + 0.8) 100 /2 = 50$, but they are quite spread out, with a density resembling a half-circle. This is related to Wigner’s semicircular law which plays a key role in random matrix theory.

$\newcommand{\bSigma}{\boldsymbol{\Sigma}}$ $\newcommand{\bmu}{\boldsymbol{\mu}}$

More advanced material: Weyl's inequality; weighted variants; image segmentation¶

Image segmentation¶

We give a different, more involved application of the ideas developed in this topic to image segmentation. Let us quote Wikipedia:

In computer vision, image segmentation is the process of partitioning a digital image into multiple segments (sets of pixels, also known as image objects). The goal of segmentation is to simplify and/or change the representation of an image into something that is more meaningful and easier to analyze. Image segmentation is typically used to locate objects and boundaries (lines, curves, etc.) in images. More precisely, image segmentation is the process of assigning a label to every pixel in an image such that pixels with the same label share certain characteristics.

Throughout, we will use the scikit-image library for processing images.

As an example, here is a picture of cell nuclei taken through optical microscopy as part of some medical experiment. It is taken from here. Here we used the function skimage.io.imread to load an image from file.

In [109]:
img = io.imread('cell-nuclei.png')
fig, ax = plt.subplots(figsize=(6, 6))
ax.imshow(img)
plt.show()
No description has been provided for this image

Suppose that, as part of this experiment, we have a large number of such images and need to keep track of the cell nuclei in some way (maybe count how many there are, or track them from frame to frame). A natural pre-processing step is to identify the cell nuclei in the image. We use image segmentation for this purpose.

We will come back to the example below. Let us start with some further examples.

We will first work with the following map of Wisconsin regions.

In [110]:
img = io.imread('wisconsin-regions.png')
fig, ax = plt.subplots(figsize=(6, 6))
ax.imshow(img)
plt.show()
No description has been provided for this image

A color image such as this one is encoded as a $3$-dimensional array (or tensor), meaning that it is an array with $3$ indices (unlike matrices which have only two indices).

In [111]:
img.shape
Out[111]:
(709, 652, 3)

The first two indices capture the position of a pixel. The third index capture the RGB color model. Put differently, each pixel in the image has three numbers (between 0 and 255) attached to it that encodes its color.

For instance, at position $(300,400)$ the RGB color is:

In [112]:
img[300,400,:]
Out[112]:
array([111, 172, 232], dtype=uint8)
In [113]:
plt.imshow(np.reshape(img[300,400,:],(1,1,3)))
plt.show()
No description has been provided for this image

To perform image segmentation using the spectral graph theory we have developed, we transform our image into a graph.

The first step is to coarsen the image by creating super-pixels, or regions of pixels that are close and have similar color. For this purpose, we will use skimage.segmentation.slic, which in essence uses $k$-means clustering on the color space to identify blobs of pixels that are in close proximity and have similar colors. It takes as imput a number of super-pixels desired (n_segments), a compactness parameter (compactness) and a smoothing parameter (sigma). The output is a label assignment for each pixel in the form of a $2$-dimensional array.

On the choice of the parameter compactness via scikit-image:

Balances color proximity and space proximity. Higher values give more weight to space proximity, making superpixel shapes more square/cubic. This parameter depends strongly on image contrast and on the shapes of objects in the image. We recommend exploring possible values on a log scale, e.g., 0.01, 0.1, 1, 10, 100, before refining around a chosen value.

The parameter sigma controls the level of blurring applied to the image as a pre-processing step. In practice, experimentation is required to choose good parameters.

In [114]:
labels1 = segmentation.slic(img, 
                            compactness=25, 
                            n_segments=100, 
                            sigma=2., 
                            start_label=0)
print(labels1)
[[ 0  0  0 ...  8  8  8]
 [ 0  0  0 ...  8  8  8]
 [ 0  0  0 ...  8  8  8]
 ...
 [77 77 77 ... 79 79 79]
 [77 77 77 ... 79 79 79]
 [77 77 77 ... 79 79 79]]

A neat way to vizualize the super-pixels is to use the function skimage.color.label2rgb which takes as input an image and an array of labels. In the mode kind='avg', it outputs a new image where the color of each pixel is replaced with the average color of its label (that is, the average of the RGB color over all pixels with the same label). As they say, an image is worth a thousand words - let's just see what it does.

In [115]:
out1 = color.label2rgb(labels1, img/255, kind='avg', bg_label=0)
out1.shape
Out[115]:
(709, 652, 3)
In [116]:
fig, ax = plt.subplots(figsize=(6, 6))
ax.imshow(out1)
plt.show()
No description has been provided for this image

Recall that our goal is to turn our original image into a graph. After the first step of creating super-pixels, the second step is to form a graph whose nodes are the super-pixels. Edges are added between adjacent super-pixels and a weight is given to each edge which reflects the difference in mean color between the two.

We use skimage.graph.rag_mean_color. In mode similarity, it uses the following weight formula (quoting the documentation):

The weight between two adjacent regions is exp(-d^2/sigma) where d=|c1−c2|, where c1 and c2 are the mean colors of the two regions. It represents how similar two regions are.

The output, which is known as a region adjacency graph (RAG), is a NetworkX graph and can be manipulated using that package.

In [117]:
g = graph.rag_mean_color(img, labels1, mode='similarity')
nx.draw(g, pos=nx.spring_layout(g))
No description has been provided for this image

scikit-image also provides a more effective way of vizualizing a RAG, using the function skimage.future.graph.show_rag. Here the graph is super-imposed on the image and the edge weights are depicted by their color.

In [118]:
fig, ax = plt.subplots(nrows=1, figsize=(6, 8))
lc = graph.show_rag(labels1, g, img, ax=ax)
fig.colorbar(lc, fraction=0.05, ax=ax)
plt.show()
No description has been provided for this image

We can apply the spectral clustering techniques we have developed in this chapter. Next we compute a spectral decomposition of the weighted Laplacian and plot the eigenvalues.

In [119]:
L = nx.laplacian_matrix(g).toarray()
print(L)
[[ 9.93627574e-01 -9.93627545e-01  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [-9.93627545e-01  1.98331432e+00 -9.89686777e-01 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00 -9.89686777e-01  1.72084919e+00 ...  0.00000000e+00
   0.00000000e+00  0.00000000e+00]
 ...
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  4.03242708e-05
   0.00000000e+00  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   7.93893423e-01  0.00000000e+00]
 [ 0.00000000e+00  0.00000000e+00  0.00000000e+00 ...  0.00000000e+00
   0.00000000e+00  1.99992197e+00]]
In [120]:
w, v = LA.eigh(L)
plt.plot(np.sort(w))
plt.show()
No description has been provided for this image

From the theory, this suggests that there are roughly 15 components in this graph. We project to $15$ dimensions and apply $k$-means clustering to find segments. Rather than using our own implementation, we use sklearn.cluster.KMeans from the scikit-learn library. That implementation uses the $k$-means$++$ initialization, which is particularly effective in practice. A label assignment for each node can be accessed using labels_.

In [121]:
ndims = 15 # number of dimensions to project to
nsegs = 10 # number of segments

top = np.argsort(w)[1:ndims]
topvecs = v[:,top]
topvals = w[top]

kmeans = KMeans(n_clusters=nsegs, random_state=12345).fit(topvecs)
assign_seg = kmeans.labels_
print(assign_seg)
[0 0 0 0 0 0 0 0 0 5 2 0 0 5 0 0 0 2 7 0 2 4 2 5 0 2 0 2 5 2 2 5 5 6 0 5 2
 2 5 6 3 5 6 6 5 5 2 5 6 3 0 1 5 6 3 6 5 8 1 6 1 3 3 6 8 8 8 1 1 0 1 0 3 8
 8 8 1 1 9 3 1]

To vizualize the segmentation, we assign to each segment (i.e., collection of super-pixels) a random color. This can be done using skimage.color.label2rgb again, this time in mode kind='overlay'. First, we assign to each pixel from the original image its label under this clustering. Recall that labels1 assigns to each pixel its super-pixel (represented by a node of the RAG), so that applying assign_seg element-wise to labels1 results is assigning a cluster to each pixel. In code:

In [122]:
labels2 = assign_seg[labels1]
print(labels2)
[[0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [1 1 1 ... 3 3 3]
 [1 1 1 ... 3 3 3]
 [1 1 1 ... 3 3 3]]
In [123]:
out2 = color.label2rgb(labels2, kind='overlay', bg_label=0)
out2.shape
Out[123]:
(709, 652, 3)
In [124]:
fig, ax = plt.subplots(nrows=2, sharex=True, sharey=True, figsize=(16, 8))
ax[0].imshow(img)
ax[1].imshow(out2)
plt.show()
No description has been provided for this image

As you can see, the result is reasonable but far from perfect.

In [125]:
def imgseg_rag(img, compactness=30, n_spixels=400, sigma=0., figsize=(10,10)):
    labels1 = segmentation.slic(img, 
                                compactness=compactness, 
                                n_segments=n_spixels, 
                                sigma=sigma, 
                                start_label=0)
    out1 = color.label2rgb(labels1, img/255, kind='avg', bg_label=0)
    g = graph.rag_mean_color(img, labels1, mode='similarity')
    fig, ax = plt.subplots(figsize=figsize)
    ax.imshow(out1)
    plt.show()
    return labels1, g
In [126]:
def imgseg_eig(g):
    L = nx.laplacian_matrix(g).toarray()
    w, v = LA.eigh(L)
    plt.plot(np.sort(w))
    plt.show()
    return w,v
In [127]:
def imgseg_labels(w, v, n_dims=10, n_segments=5, random_state=0):
    top = np.argsort(w)[1:n_dims]
    topvecs = v[:,top]
    topvals = w[top]
    kmeans = KMeans(n_clusters=n_segments, 
                    random_state=random_state).fit(topvecs)
    assign_seg = kmeans.labels_
    labels2 = assign_seg[labels1]
    return labels2
In [128]:
def imgseg_viz(img, labels2, figsize=(20,10)):
    out2 = color.label2rgb(labels2, kind='overlay', bg_label=0)
    fig, ax = plt.subplots(nrows=2, sharex=True, sharey=True, figsize=figsize)
    ax[0].imshow(img)
    ax[1].imshow(out2)
    plt.show()

Let's try a more complicated image. This one is taken from here.

In [129]:
img = io.imread('two-badgers.jpg')
fig, ax = plt.subplots(figsize=(10, 10))
ax.imshow(img)
plt.show()
No description has been provided for this image

Recall that the choice of parameters requires significant fidgeting.

In [130]:
labels1, g = imgseg_rag(img, 
                        compactness=30, 
                        n_spixels=1000, 
                        sigma=0., 
                        figsize=(10,10))
No description has been provided for this image
In [131]:
w, v = imgseg_eig(g)
No description has been provided for this image
In [132]:
labels2 = imgseg_labels(w, v, n_dims=60, n_segments=50, random_state=535)
imgseg_viz(img,labels2,figsize=(20,10))
No description has been provided for this image

Finally, we return to our medical example. We first reload the image and find super-pixels.

In [133]:
img = io.imread('cell-nuclei.png')
labels1, g = imgseg_rag(img,compactness=40,n_spixels=300,sigma=0.1,figsize=(6,6))
No description has been provided for this image

We then form the weighted Laplacian and plot its eigenvalues. This time, about $40$ dimensions seem appropriate.

In [134]:
w, v = imgseg_eig(g)
No description has been provided for this image
In [135]:
labels2 = imgseg_labels(w, v, n_dims=40, n_segments=30, random_state=535)
imgseg_viz(img,labels2,figsize=(20,10))
No description has been provided for this image

This method is quite finicky. The choice of parameters affects the results significantly. You should see for yourself.

We mention that scikit-image has an implementation of a closely related method, Normalized Cut, skimage.graph.cut_normalized. Rather than performing $k$-means after projection, it recursively performs $2$-way cuts on the RAG and resulting subgraphs.

We try it next. The results are similar as you can see.

In [136]:
labels2 = graph.cut_normalized(labels1, g)
imgseg_viz(img,labels2,figsize=(20,10))
No description has been provided for this image